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Optimal searching of gapped repeats in a word 


Maxime Crochemore* Roman Kolpakov^ 
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Abstract 

Following (Kolpakov et al., 2013; Gawrychowski and Manea, 2015), 
we continue the study of a-gapped repeats in strings, defined as fac¬ 
tors uvu with \uv\ < a\u\. Our main result is the 0{an) bound on 
the number of maximal a-gapped repeats in a string of length n, pre¬ 
viously proved to be O(a^n) in (Kolpakov et ah, 2013). For a closely 
related notion of maximal 5-subrepetition (maximal factors of expo¬ 
nent between 1-1-5 and 2), our result implies the 0(n/5) bound on 
their number, which improves the bound of (Kolpakov et ah, 2010) by 
a logn factor. 

We also prove an algorithmic time bound 0{an + S) {S size of the 
output) for computing all maximal a-gapped repeats. Our solution, 
inspired by (Gawrychowski and Manea, 2015), is different from the 
recently published proof by (Tanimura et ah, 2015) of the same bound. 
Together with our bound on S, this implies an 0(an)-time algorithm 
for computing all maximal a-gapped repeats. 


1 Introduction 

Notation and basic definitions. Let w = tc[l]tc[2]... w[n] = w[l .. n] be 

an arbitrary word. The length n of ta is denoted by |ia|. For any 1 < ^ < j < 
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n, word w[i]... w[j] is called a factor of w and is denoted by w[i.. j]. Note 
that notation w[i.. j] denotes two entities: a word and its occurrence starting 
at position i in w. To underline the second meaning, we will sometimes use 
the term segment. Speaking about the equality between factors can also be 
ambiguous, as it may mean that the factors are identical words or identical 
segments. If two factors u, v are identical words, we call them equal and 
denote this hy u = v. To express that u and v are the same segment, we use 
the notation u = v. For any i = 1.. .n, factor r(;[l.. i] (resp. w[i ■ .n]) is a 
prefix (resp. suffix) of w. By positions on w we mean indices 1,2,... ,n of 
letters in w. For any factor v = w[i.. j] of w, positions i and j are called 
respectively start position and end position of v and denoted by beg{v) and 
end{v) respectively. Let u,v be two factors of w. Factor u is contained in v 
iff heg{v) < beg{u) and end{u) < end{v). Letter w[i] is contained in v iff 
heg{v) <i< end{v). 

A positive integer p is called a period of tc if tc[i] = w[i + p] for each 
i = 1,... ,n — p. We denote by per{w) the smallest period of w and dehne 
the exponent of w as exp{w) = \w\/per{w). A word is called periodic if its 
exponent is at least 2 . Occurrences of periodic words are called repetitions. 

Repetitions, squares, runs. Patterns in strings formed by repeated fac¬ 
tors are of primary importance in word combinatorics [ 22 ] as well as in var¬ 
ious applications such as string matching algorithms mu, molecular biol¬ 
ogy [S], or text compression [21]. The simplest and best known example 
of such patterns is a factor of the form uu, where m is a nonempty word. 
Such repetitions are called squares. Squares have been extensively studied. 
While the number of all square occurrences can be quadratic (consider word 
a"'), it is known that the number of primitively-rooted squares is 0{nlogn) 
[ 2 ], where a square uu is primitively-rooted if the exponent of u is not an 
integer greater than 1 . An optimal 0(n log? 7 ,)-time algorithm for hnding all 
primitively-rooted squares was proposed in [5]. 

Repetitions can be seen as a natural generalization of squares. A repeti¬ 
tion in a given word is called maximal if it cannot be extended by at least one 
letter to the left nor to the right without changing (increasing) its minimal 
period. More precisely, a repetition r = w[i. .j] in tc is called maximal if it 
satishes the following conditions: 

1 . w[i — 1 ] ^ w[i — 1+ per{r)] if z > 1 , 

2 . w[j -|- 1 — perfr)] 7 ^ w[j + 1 ] if j < n. 
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For example, word cababaaa has two maximal repetitions: ababa and aaa. 
Maximal repetitions are usually called runs in the literature. Since any rep¬ 
etition is contained in some run, the set of all runs can be considered as a 
compact encoding of all repetitions in the word. This set has many useful 
applications, see, e.g., |7]. For any word tc, we denote by TZ{w) the number of 
maximal repetitions in w and by £ [w) the sum of exponents of all maximal 
repetitions in w. The following statements are proved in [TB] . 

Theorem 1 max|^|=„£^(t(;) =0{n). 

Corollary 1 ma.x\w\=nT^{w) = 0{n). 

A series of papers (e.g., B 0) focused on more precise upper bounds on 
£{w) and 7l{w) trying to obtain the best possible constant factor behind the 
0-notation. A breakthrough in this direction was recently made in [2] where 
the so-called “runs conjecture” 7l{w[l..n]) < n was proved. To the best of 
our knowledge, the currently best upper bound 7l{w[l..n]) < on 7l{w) 
is shown in mi- 

On the algorithmic side, an 0(n)-time algorithm for hnding all runs in 
a word of length n was proposed in |T6] for the case of constant-size alpha¬ 
bet. Another 0(?7,)-time algorithm, based on a different approach, has been 
proposed in j2]. The 0{n) time bound holds for the (polynomially-bounded) 
integer alphabet as well, see, e.g., [2]. However, for the case of unbounded- 
size alphabet where characters can only be tested for equality, the lower 
bound r2(nlogn) on computing all runs has been known for a long time [23] . 
It is an interesting open question (raised over 20 years ago in [3]) whether the 
0{n) bound holds for an unbounded linearly-ordered alphabet. Some results 
related to this question have recently been obtained in |21j . 

Gapped repeats and subrepetitions. Another natural generalization of 
squares are factors of the form uvu where u and v are nonempty words. We 
call such factors gapped repeats. For a gapped repeat uvu, the left (resp. 
right) occurrence of u is called the left (resp. right) copy, and v is called the 
gap. The period of this gapped repeat is |m| -|- |n|. For a gapped repeat vr, we 
denote the length of copies of tt by c(7r) and the period of vr by p(vr). Note 
that a gapped repeat tt = uvu may have different periods, and pertji) < p{n). 
For example, in string cabacaabaa, segment abacaaba corresponds to two 
gapped repeats having copies a and aba and periods 7 and 5 respectively. 
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Gapped repeats forming the same segment but having different periods are 
considered distinct. This means that to specify a gapped repeat it is generally 
not sufficient to specify its segment. If u', u” are equal non-overlapping factors 
and u' occurs to the left of u", then by we denote the gapped repeat 

with left copy u' and right copy u". For a given gapped repeat («', u"), equal 
factors u'[i.. j] and u''[i.. j], for 1 < z < j < \u'\, of the copies u', u" are 
called corresponding factors oi repeat 

For any real a > 1, a gapped repeat tt is called a-gapped if p(7r) < ac(7r). 
Maximality of gapped repeats is dehned similarly to repetitions. A gapped 
repeat {w[i'.. j'],w[i''.. j'']) in w is called maximal if it satishes the following 
conditions: 

1. w[i' — 1] 7^ w[i'' — 1] if i' > 1, 

2. w[j' + 1] 7 ^ w[j'' -I- 1] if j" < n. 

In other words, a gapped repeat tt is maximal if its copies cannot be extended 
to the left nor to the right by at least one letter without breaking its period 
pin). As observed in [12], any a-gapped repeat is contained either in a 
(unique) maximal a-gapped repeat with the same period, or in a (unique) 
maximal repetition with a period which is a divisor of the repeat’s period. 
For example, in the above string cabacaabaa, gapped repeat (ab)aca(ab) is 
contained in maximal repeat (aba)ca(aba) with the same period 5. In string 
cabaaabaaa, gapped repeat (ab)aa(ab) with period 4 is contained in maximal 
repetition abaaabaaa with period 4. Since all maximal repetitions can be 
computed efficiently in 0(n) time (see above), the problem of computing all 
a-gapped repeats in a word can be reduced to the problem of hnding all 
maximal a-gapped repeats. 

Several variants of the problem of computing gapped repeats have been 
studied earlier. In [1], it was shown that all maximal gapped repeats with a 
gap length belonging to a specihed interval can be found in time (9(nlogn-|- 
A), where n is the word length and S is output size. In [20], an algorithm 
was proposed for hnding all gapped repeats with a hxed gap length d running 
in time Oinlogd -1-5'). In [T2], it was proved that the number of maximal 
a-gapped repeats in a word of length n is bounded by O(a^n) and all max¬ 
imal a-gapped repeats can be found in O(a^n) time for the case of integer 
alphabet. A new approach to computing gapped repeats was recently pro¬ 
posed in [131 [10]. In particular, in [13] it is shown that the longest a-gapped 
repeat in a word of length n over an integer alphabet can be found in 0(an) 
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time. Finally, in a recent paper [2S], an algorithm is proposed for finding all 
maximal a-gapped repeats in 0{an + S) time where S is the output size, 
for a constant-size alphabet. The algorithm uses an approach previously 
introduced in [T]. 

Recall that repetitions are segments with exponent at least 2. Another 
way to approach gapped repeats is to consider segments with exponent 
smaller than 2, but strictly greater than 1. Clearly, such a segment cor¬ 
responds to a gapped repeat vr = uvu with per{n) = p(7r) = |m| -|- |u|. We will 
call such factors (segments) subrepetitions. More precisely, for any 5, 0 < 5 < 
1, by a 5-subrepetition we mean a factor v that satishes 1 -|- 5 < expiy) < 2. 
Again, the notion of maximality straightforwardly applies to subrepetitions 
as well: maximal subrepetitions are dehned exactly in the same way as maxi¬ 
mal repetitions. The relationship between maximal subrepetitions and max¬ 
imal gapped repeats was clarihed in |T9]. Directly from the dehnitions, a 
maximal subrepetition tt in a string w corresponds to a maximal gapped 
repeat with p(7r) = per{7r). Futhermore, a maximal 5-subrepetition corre¬ 
sponds to a maximal ^-gapped repeat. However, there may be more maximal 
^-gapped repeats than maximal 5-subrepetitions, as not every maximal |- 
gapped repeat corresponds to a maximal 5-subrepetition. 

Some combinatorial results on the number of maximal subrepetitions in 
a string were obtained in [18]. In particular, it was proved that the number 
of maximal 5-subrepetitions in a word of length n is bounded by 0(^ logn). 
In [T^, an 0{n/S^) bound on the number of maximal 5-subrepetitions in 
a word of length n was obtained. Moreover, in [19], two algorithms were 
proposed for hnding all maximal 5-subrepetitions in the word running re¬ 
spectively in time and in 0(nlog?7. -|- ^ log |) expected time, 

over the integer alphabet. In [T], it is shown that all subrepetitions with 
the largest exponent (over all subrepetitions) can be found in an overlap-free 
string in time 0(u), for a constant-size alphabet. 

Our results. In the present work we improve the results of [IH] on max¬ 
imal gapped repeats: we prove an asymptotically tight bound of 0{an) on 
the number of maximal a-gapped repeats in a word of length n (Section [3]). 
From our bound, we also derive a 0{n/6) bound on the number of maximal 
5-subrepetitions occurring in the word, which improves the bound of [IH] by 
a logn factor. Then, based on the algorithm of [I3], we obtain an asymptoti¬ 
cally optimal 0{an) time bound for computing all maximal a-gapped repeats 
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in a string (Section 0]). Note that this bound follows from the recently pub¬ 
lished paper [25] that presents an 0{an + S) algorithm for computing all 
maximal a-gapped repeats. Here we present an alternative algorithm with 
the same bound that we obtained independently. 

2 Preliminaries 

In this Section we state a few propositions that will be used later in the 
paper. The following fact is well-known (see, e.g., m Proposition 2]). 

Proposition 1 Any period p of a word v such that 1^1 > 2p is divisible by 
per{v), the smallest period ofv. 

Let A be some natural number. A period p of some word v is called A- 
period if p is divisible by A. The minimal A-period of v, if exists, is denoted 
by Pa^v). The word v is called A-periodic if |n| > 2pa{v). It is obvious that 
any A-periodic word is also periodic. Proposition [1] can be generalized in the 
following way. 

Proposition 2 Any A-period p of a word v such that |u| > 2p is divisible 
hypAiv). 

Proof. By Proposition [H period p is divisible by per{v), so p is divisible by 
LCM{per{v), A). On the other hand, LCM{per{v), A) is a A-period of v. 
Thus, Pa{v) = LCM{per{v), A), and p is divisble by Pa('i^)- ■ 

Consider an arbitrary word w = w\i . .n] of length n. Recall that any 
repetition ?/ in tc is extended to a unique maximal repetition r with the same 
minimal period. We call r the extension of y. 

Let r be a repetition in the word w. We call any factor of w of length 
perfr) which is contained in r a cyclic root of r. For cyclic roots we have the 
following property proved, e.g., in [191 Proposition 2]. 

Proposition 3 Two cyclic root u', u" of a repetition r are equal if and only 
if begin') = begin") (mod per(r)). 
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3 Number of maximal repeats and subrepe¬ 
titions 

In this section, we obtain an improved upper bound on the number of max¬ 
imal gapped repeats and subrepetitions in a string w. Following the general 
approach of [IH], we split all maximal gapped repeats into three categories ac¬ 
cording to periodicity properties of repeat’s copy: periodic, semiperiodic and 
ordinary repeats. Bounds for periodic and semiperiodic repeats are directly 
borrowed from [19], while for ordinary repeats, we obtain a better bound. 

Periodic repeats. We say that a maximal gapped repeat is periodic if its 
copies are periodic strings (i.e. of exponent at least 2). The set of all periodic 
maximal a-gapped repeats in w is denoted by Wa- The following bound on 
the size of Wa was been obtained in [191 Corollary 6]. 

Lemma 1 \VVk\ = 0{kn) for any natural k > 1. 

Semiperiodic repeats. A maximal gapped repeat is called prefix (suffix) semi¬ 
periodic if the copies of this repeat are not periodic, but have a prefix (suffix) 
which is periodic and its length is at least half of the copy length. A maximal 
gapped repeat is semiperiodic if it is either prefix or suffix semiperiodic. The 
set of all semiperiodic a-gapped maximal repeats is denoted by SVa- In [13 
Corollary 8], the following bound was obtained on the number of semiperiodic 
maximal a-gapped repeats. 

Lemma 2 ( [TT> j) \SVk\ = 0{kn) for any natural k > 1. 

Ordinary repeats. Maximal gapped repeats which are neither periodic nor 
semiperiodic are called ordinary. The set of all ordinary maximal a-gapped 
repeats in the word w is denoted by OVa- In the rest of this section, we 
prove that the cardinality of OVa is 0{an). For simplicity, assume that a is 
an integer number k. 

To estimate the number of ordinary maximal fc-gapped repeats, we use 
the following idea from [15]. We represent a maximal repeat vr = {u',u'') 
from OVk by a triple (i,i, c) where i = beg{u'), j = beg{u') and c = c(7r) = 
\u'\ = \u''\. Such triples will be called points. Obviously, vr is uniquely dehned 
by values i, j and c, therefore two different repeats from OVk can not be 
represented by the same point. 
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For any two points (F,/, c'), c") we say that point (F, j', c') covers 

point if i' < i" < i' + c'/G, j' < j” < f + c'/G, c' > d' > ^. A 

point is covered by a repeat n if this it is covered by the point representing tt. 
By V^[7r] we denote the set of all points covered by a repeat n. We show that 
any point can not be covered by two different repeats from OVk- 

Lemma 3 Two different repeats from OVk cannot cover the same point. 

Proof. Let tti = {ufu”)., 712 = {u^Tu'f) be two different repeats from OVk 
covering the same point (i, j, c). Denote Ci = c(7ri), C 2 = c(7r2), Pi = per(7ri), 
P 2 = per{ 712 )- Without loss of generality we assume Ci > C 2 . From Ci > c > 
C 2 > c > ^ we have Ci > C 2 > i.e. C 2 < Ci < Note that w[i] 
is contained in both left copies u[,U 2 , i.e. these copies overlap. If pi = p 2 , 
then repeats tti and 7r2 must coincide due to the maximality of these repeats. 
Thus, Pi P 2 - Denote A = |pi — P 2 I >0. From beg{u[) < i < begiu'fj +ci/G 
and begin”) < j < begin”) + Ci/G we have 

if - i) - Ci/G < Pi < if -i) + Ci/G. 

Analogously, we have 

if -i) - C 2 /G <P2< ij -i) + C 2 /G. 

Thus A < (ci + C 2 )/G which, together with inequality ci < implies 

A < ^ 

— 12 • 

First consider the case when one of the copies n[,n 2 is contained in the 
other, i.e. ^2 is contained in n[. In this case, n” contains some factor €,'2 
corresponding to the factor n '2 in n). Since begin'f) — begin' 2 ) = P 2 , begiu'f) — 
begin' 2 ) = Pi and ^2 = U 2 = n' 2 , we have 

\begin' 2 ) - hegiu'f)\ = A, 

so A is a period of n '2 such that A < -^C 2 = Thus, n'^ is periodic 

which contradicts that 772 is not periodic. 

Now consider the case when n'i,n '2 are not contained in one another. 
Denote by z' the overlap of n'^ and n' 2 . Let z' be a suffix of n'^. and a prehx of 
n'l where k,l = 1,2, k 1. Then n'f contains a suffix z” corresponding to the 
suffix z' in n!^, and nff contains a prehx z” corresponding to the prehx z' in 
n'l. Since begiz”) — begiz') = pk and beg id') — begiz') = pi and z” = z” = z', 
we have 


\begiz'') - begid') \ = {p^ - pi\ = A, 



therefore A is a period of z'. Note that in this case 


begin';.) < begin';) < i < begin';.) + Cfc/6, 
therefore 0 < begin';) — begin';^) < Ck/6. Thus 

5 5 

=Ck- ibegin'i) - begin';,)) > -Ck > -Ca- 

D 0 

From A < ^C 2 and ca < ||z'| we obtain A < \z'\l‘l. Thus, z' is a periodic 
suffix of n!k such that \z'\ > llu^l, i.e. vr^ is either suffix semiperiodic or 
periodic which contradicts vr^ E OVk- ■ 

Denote by Qk the set of all points (i,j, c) such that 1 < i,j,c < n and 
i<j<i + (|fc+ i)c. 

Lemma 4 Any point covered by a repeat from OVk belongs to Qk- 

Proof. Let a point (i, j, c) be covered by some repeat tt = in', n") from OVk- 
Denote c' = c(7r). Note that w[i] and w[j] are contained respectively in n' 
and n" and n > c' > c > ^ > 0, so inequalities 1 < i, j,c < n and i < j are 
obvious. Note also that 

j < begin") + c'/6 = begin') + per(7r) + c'/S < i + /cc' + c'/S, 

therefore, taking into account c' < y, we have j < i + i\k + \)c- ■ 

From Lemmas [3] and IU we obtain 

Lemma 5 \OVk\ = 0(n/c). 

Proof. Assign to each point ii,j,c) the weight pii,j,c) = \/(?- For any 
hnite set A of points, we dehne 

P(^) = H p(lJ,c)= ^ 4- 

{i,j,c}£A {i,j,c)eA ^ 

Let TT be an arbitrary repeat from OVk represented by a point ii',j',c')- 
Then 

= E E E 4 

i'<i<i'+c'/Q l6 2dIZ<c<d 

E V i 

2c'l3<c<c' ^ 
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Using a standard estimation of sums by integrals, one can deduce that 
E 2 c 73 <c<c' ^ for any d. Thus, for any tt from OVk 


Therefore, 

y p(i^M) = n(|c>m). 

Tr&OVk 

Note also that 


pm 



nkn^ 

3 


( 1 ) 


Thus, 


p{Qk) = 0{nk). 


( 2 ) 


By Lemma 01 any point covered by repeats from OVk belongs to Qk- On 
the other hand, by Lemma [3l each point of Qk can not be covered by two 
repeats from OVk- Therefore, 


^ p(U[7r]) < p{Qk). 

■K&OVk 

Thus, using [1] and m we conclude that \OVk\ = 0{nk). ■ 

Putting together Lemma [H Lemma [2l and Lemma [5l we obtain that for 
any integer fc > 2, the number of maximal fc-gapped repeats in w is 0{nk). 
The bound straightforwardly generalizes to the case of real a > 1. Thus, we 
conclude with 


Theorem 2 For any a > 1, the number of maximal a-gapped repeats in w 
is 0{an). 

Note that the bound of Theorem [2] is asymptotically tight. To see this, 
it is enough to consider word Wk = (0110)*^. It is easy to check that for a big 
enough a and k = f2(a), Wk contains 0(a|i/;fc|) maximal a-gapped repeats 
whose copies are single-letter words. 
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We now use Theorem [2] to obtain an upper bound on the number of 
maximal (5-subrepetitions. The following proposition, shown in [T21 Propo¬ 
sition 3], follows from the fact that each maximal (5-subrepetition dehnes at 
least one maximal l/(5-gapped repeat (cf. Introduction). 

Proposition 4 ( |T5j ) ForO < (5 < 1, the number of maximal 5-subrepetitions 
in a string is no more then the number of maximal 1/5-gapped repeats. 

Theorem [2] combined with Proposition H] immediately imply the following 
upper bound for maximal (5-subrepetitions that improves the bound of [18] 
by a logn factor. 

Theorem 3 For 0 < (5 < 1, the number of maximal 5-subrepetitions in w is 
0 {n/6). 

The 0{n/S) bound on the number of maximal (5-subrepetitions is asymp¬ 
totically tight, at least on an unbounded alphabet : word abiab 2 ... ab^ 
contains fl{n/6) maximal (5-subrepetitions for 6 < 1/2. 

4 Computing all maximal ct-gapped repeats 

In this section, we present an 0{an-^S) algorithm for computing all maximal 
a-gapped repeats in a word w. This bound has been recently announced in 
[25] . here we present a different solution. Together with the the 0{an) bound 
of Theorem [21 this implies an 0(an)-time algorithm. 

4.1 Computing PR-repeats 

Some maximal Q;-gapped repeats can be specifically located as defined be¬ 
low within maximal repetitions (runs). For example, word cabababababaa 
contains maximal gapped repeats (a)babababab(a), (aba)babab(aba) and 
(ababa)b(ababa) within the run abababababa = (ab)^^/^. In this section, we 
describe the structure of such repeats, and in particular those of them which 
are periodic (see Section |3]), like the repeat (ababa)b(ababa) above. We 
show how those maximal a-gapped repeats can be extracted from the runs. 
Repeats which are located within runs but are not periodic will be found sep¬ 
arately, together with repeats (periodic or not) which are not located within 
runs. This part will be described in the next section. 
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Let TT = {u',u") be a periodic gapped repeat. If the extensions of u' and 
u" are the same repetition r then we say that r generates vr and we call tt PR- 
repeat (abbreviating from Periodic Run-generated). Gapped repeats which 
are not PR-repeats are called non-PR repeats. We will use the following fact. 

Proposition 5 Let tt = («', u”) be a maximal gapped repeat such that its 
copies u' and u" contain a pair of corresponding factors having the same 
extension r. Then vr is generated by r. 

Proof. Observe that to prove the proposition, it is enough to show that 
both copies u' and u” are contained in r, i.e. beg{r) < beg{u') and end{r) > 
end{u"). Let beg{r) > beg{u'). Then both letters tc[ 6 e^(r) — l] aiidw[beg{r) — 
1 -|- per(r)] are contained in u'. Let these letters be respectively j-th and 
(j -|-per(r))-th letters of u'. Then we have u''[i] = u'[i] ^ v!\f -|-per(r)] = 
yii^j +per(r)], i.e. u"[j] 7 ^ u"[j -|-per(r)], which is a contradiction to the 
fact that both letters u''[j] and u''[j -|-per(r)] are contained in r. Relation 
end{r) > end{u") is proved analogously. ■ 

All maximal PR-repeats can be easily computed according to the follow¬ 
ing lemma. 

Lemma 6 A maximal gapped periodic repeat it = {u', u") is generated by a 
maximal repetition r if and only i/p(vr) is divisible by per{r) and 

|r |/2 < p( 7 r) < |r| — 2per{r), 
u' = w[beg{r).. end{r) — p(7r)], 
u" = w[beg{r) -\-per{r).. end{r)]. 

Proof. Let vr be generated by r. Consider prehxes of u' and u” of length 
per{r). These prehxes are equal cyclic roots of r, and by Proposition [3] 
the difference beg{u") — beg{u') = p{7r) is divisible by per{r). Inequalities 
|r |/2 < p{n) < |r| — 2 per{r) follow immediately from the dehnition of a 
repeat generated by a repetition. To prove the last two conditions of the 
lemma, it is sufficient to prove begin') = beg{r) and end{u") = end{r). Let 
begin') 7 ^ beg{r), i.e. begin') > begir). Then both letters w[begin') — 1] and 
w[begin") — 1] are contained in r. Thus, since the difference ibegin") — 1) — 
ibegin') — !) = pin) is divisible by per (r), we have w[begin') — l] = w[begin") — 
1] which contradicts the maximality of n. The relation endin'') = endir) is 
proved analogously. Thus, all the conditions of the lemma are proved. On the 
other hand, if n satishes all the conditions of the lemma then n is obviously 
generated by r. ■ 
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Corollary 2 A maximal repetition r generates no more than exp{r)/2 max¬ 
imal PR-repeats, and all these repeats can be computed from r in 0{exp{r)) 
time. 

To find all maximal a-gapped PR-repeats in a string w, we first compute 
all maximal repetitions in w in 0{n) time (see Introduction). Then, for each 
maximal repetition r, we output all maximal a-gapped repeats generated 
by r. Using Corollary [21 this can be done in 0{exp{r)) time. Thus the total 
time of processing all maximal repetitions is 0{S{w)). Since E(t(;) = 0{n) 
by Theorem [H all maximal o-gapped PR-repeats in w can be computed in 
0{n) time. 

4.2 Computing non-PR repeats 

We now turn to the computation of maximal non-PR a-gapped repeats. Re¬ 
call that non-PR repeats are those which are either non-periodic, or periodic 
but not located within a single run. Our goal is to show that all maximal 
non-PR a-gapped repeats can be found in 0{an) time. Observe that there 
exists a trivial algorithm for computing all maximal a-gapped repeats in 
O(n^) time that proceeds as follows: for each period p < n. End all maximal 
a-gapped repeats with period p in 0{n) time by consecutively comparing 
symbols w[i] and w[i -+■ p] for i = 1, 2,..., n — p. 

From the results of [1], it follows that all maximal a-gapped repeats can 
be found in time 0{n\ogn S). This, together with Theorem [2], implies an 
0{an)-iim.e algorithm for the case a > logn. Therefore, we only have to 
consider the case a < logn. 

(i) Preliminaries 

Assume that a < logn. For this case, we proceed with a modihcation of 
the algorithm of [TS]. We compute all maximal a-gapped non-PR repeats 
Ti m w such that c(7r) > logn. To do this, we divide w into blocks of A = 
(logn)/4 consecutive symbols of w. Without loss of generality, we assume 
that n = 2^A, i.e. w contains exactly 2^ blocks. A word x of length 2^A 
where 0 < / < fc — 1 is called a basic factor oiwiix = w[i/S. -|- 1.. (i -|- 2*) A] 
for some i. Such an occurrence w[i/S. -|- 1.. (i -|- 2*)A] of x starting at a 
block frontier will be called aligned. A basic factor x of length 2*A, where 
1 < Z < /c — 1, is called superbasic if x = w[i2^A^ -|- 1.. (z -f 1)2^A] for some i. 
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Note that w contains 0{n) aligned occurrences of basic factors and 
aligned occurrences of superbasic factors. Let = w[q2’‘A + 1.. (g + 1)2^A] 
be an aligned occurrence of superbasic factor of length 2* in tc. For r = 
0,1, • • • A —1, an occurrence w[q2’‘A + l+T .. (g2^+2*“^)A+r] of a basic factor 
of length 2*“^ A is called r-associated (or simply associated) with 2 :. Note that 
any basic factor occurrence r-associated with 2 is entirely contained in 2 and 
is uniquely dehned by 2 and r. Thus, 2 has no more than A associated 
occurrences of basic factors. 

To continue, we need one more dehnition : for 1 < < n, denote by 

LCP{i,j) the length of the longest common prehx of w[i.. n] and w[j .. n], 
and by LCS{i,j) the length of the longest common suffix of w[l..i] and 
w[l..j]. 

Let TT = (m', u”) be a maximal gapped repeat in w such that c(7r) > 
logn = 4A. Note that in this case, the left copy u' contains at least one 
aligned occurrence of superbasic factors. Consider aligned occurrences of 
superbasic factors of maximal length contained in u'. Note that u' can contain 
either one or two adjacent such occurrences. Let 2 be the leftmost of them. 
Note that in this case, we have the following restrictions imposed on u': 

beg{z) - | 2 :| < beg{u') < beg{z), 
end{z) < end{u') < end{z) -l- 2 | 2 |. ^ 

Thus, c(7r) < 4|2|. Consider factor z" in u” corresponding to 2 in u'. Note 
that z” can be non-aligned. Consider in z" the leftmost aligned basic factor 
y" of of length | 2 "|/ 2 . Observe that beg{z'') < beg{y'') < beg{z") + A and y" is 
entirely contained in z”. Let y' be the factor of 2 corresponding to factor y” in 
z”. It is easily seen that y' is an occurrence of a basic factor associated with 2 , 
and TT is uniquely dehned by 2 , y' and y”. Thus, any maximal gapped repeat 
TT such that c(7r) > logn is uniquely dehned by a triple {z,y',y"), where 2 is 
an aligned occurrence of some superbasic factor, y' is an occurrence of some 
basic factor associated with 2 , and y” is an aligned occurrence of the same 
basic factor. From now on, we will say in such case that tt is defined by the 
triple {z,y',y"). 

Observe that n = {u', u") can be retrieved from (z, y', y”) using LCP and 
LCS functions. 

beg{u') = beg{y') - LCS {beg {y') - 1, beg{y'') - 1), 
end{u') = end{y ) + LCP{end{y') + 1, end{y") + 1), 
beg{u") = beg{y'') - LCS {beg {y') - 1, beg{y'') - 1), 
end{u") = end{y ) + LCP{end{y') -|- 1, end{y") + 1). 
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Assume additionally that vr is an a-gapped repeat for a > 1. Then, taking 
into account inequalities ([3]) and c(7r) < 4\z\, we have 

end{y”) < end{u") = end{u') +per{n) < end{z) + 2 \z\ + ac(7r) 

< end(z) + 2 \z\ + 4q;|2;| < end{z) + 6a\z\ = end{z) + 12 a\y"\. 

On the other hand, beg{y'') > beg{u") > end{u') > end( 2 ;). Thus, for any 
triple {z, y', y") dehning a maximal a-gapped repeat in w the occurrence y" 
is contained in the segment w[end{z) + l ■ ■ end{z) + l2a\y"\] of length 12 a\y''\ 
to the right of We will denote this segment by T{z). The main idea of 
the algorithm is to consider all triples {z, y', y") which can dehne maximal 
a-gapped non-PR repeats and for each such triple, check if it actually de- 
hnes one, which is then computed and output. All the triples {z,y',y") are 
considered in a natural way: for each aligned occurrence z of a superbasic 
factor and each occurrence y' of a basic factor associated with z, we consider 
all aligned occurrences y” of the same basic factor in the segment I{z). 

(ii) Naming basic factors on a sufRx tree and computing their as¬ 
sociated occurrences 

We now describe how this computation is implemented. First we construct 
a suffix tree for the input string w. Suffix tree is a classical data structure of 
size 0 {n) which can be constructed in 0 {n) for a word over constant alphabet 
see e.g. [H]. Using the suffix tree, we can make in (9(n)-time preprocessing 
which allows to retrieve LCP{i,j) for any i,j in constant time, see e.g. [14] . 
Similarly, we precompute w to support LCS{i,j) for any i,j in constant time. 
Then we compute all basic factors of w. This computation is performed by 
naming all the basic factors, i.e. assigning to each aligned occurrence of a 
basic factor a name of this factor. The most convenient way to name basic 
factors is to assign to a basic factor y of length 2^ a pair (/, z), where i is the 
start position of the leftmost aligned occurrence of y in w. Note that since 
we have only n/A distinct start positions i, the size of the two-dimensional 
array required for working with these pairs is O {n). To perform the required 
computation, we hrst mark in the suffix tree each node labeled by a basic 
factor by the name of this factor (in the case when this node is implicit we 
make it explicit). To this end, for each node v of the suffix tree we compute 
the value minleaf {v) which is the smallest leaf number divisible by A in the 
subtree rooted in v if such a number exists. This can be easily done in 0{n) 
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time by a bottom-up traversal of the tree. Then, each suffix tree edge (u, v) 
such that the string depth of u is less than 2\ the string depth of v is not 
less than 2\ and minleaf {v) is dehned is treated in the following way: if the 
string depth of v is 2\ node v is marked by name {I, minleaf {v)), otherwise 
a new node of string depth 2^ is created within edge {u, v) and marked by 
name {I, minleaf {v)). The obtained tree will be called marked suffix tree. 
Since we have 0{n) distinct basic factors, the marked suffix tree contains no 
more than 0{n) additionally inserted nodes. Thus, this tree has 0{n) size 
and is constructed in 0 {n) time. 

To assign to each aligned occurrence w[i.. i + 2’' — 1] of a basic factor the 
name of this factor, we perform a depth-first top-down traversal of the marked 
suffix tree. During the traversal we maintain an auxiliary array basancestor: 
at the hrst visit of a node marked by a name {l,m) we set basancestor[l] to 
m, and at the second visit of this node we reset basancestor[l] to undehned. 
While during the traversal we get to a leaf i divisible by A, for each I = 
0,1,..., /c — 1 we identify w[i.. i + 2^ — 1] as an occurrence of the basic factor 
named by {l,basancestor[l]). Note that this traversal is performed in 0{n) 
time. 

Then, we compute all occurrences of basic factors associated with aligned 
occurrences of superbasic factors. This is done again by a depth-first top- 
down traversal of the marked suffix tree. During the traversal, we maintain 
the same auxiliary array basancestor. Assume that during the traversal 
we get to a leaf labelled by a position q2PA -|- 1 -|- r, where q is odd and 
0 < r < A. Then for each I = 0, l,...,p — 1 such that basancestor[l] 
is dehned, we identify w[g2^A -|- 1 -|- r.. (g2^ -|- 2^) A -|- r] as an occurence 
of the basic factor named {l,basancestor[l]), which is r-associated with the 
superbasic factor occurrence w[q2PA -|- 1.. (g2^ -|- 2^’''^)A]. Observe that this 
traversal is performed in 0 {n) time as well. 

(iii) Computing lists of aligned occurrences of basic factors 

Let y he a. A-periodic basic factor (cf Introduction). Note that y is also 
periodic, and then any occurrence of ?/ in tc is a repetition. By Proposition [H 
the period per{y) is a divisor of Pa( 2/)- Given the value PA^y), we can compute 
in constant time the extension r of any occurrence y' of a A-periodic basic 
factor y as follows: 

beg{r) = beg{y') - LCS{beg{y') - 1, beg{y') + pA^y) - 1), 
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end{r) = end{y') + LCP{beg{y') + 1, beg{y') - pAiy) + 1). 

Using Proposition HI it is easy to show that any set of all aligned occurrences 
of y having the same extension is a sequence of occurrences, where the differ¬ 
ence between start positions of any two consecutive occurrences is equal to 
Pa( 2 /); he. the start positions of all these occurrences form a hnite arithmetic 
progression with common difference Pa(p)- We will call these sets runs of 
occurrences. The following fact can be easily proved. 

Proposition 6 Let y', y" he two consecutive aligned occurrences of a ba¬ 
sic factor y in w. Then \beg{y') — heg{y'')\ < \y\/2 if and only if y is 
A-periodic, y' and y" are contained in the same run of occurrences, and, 
moreover, \beg{y') — beg{y'') \ =PA{y)- 

At the next step of the algorithm, in order to effectively select appropriate 
occurrences y" in the checked triples {z,y',y"), for each basic factor y we 
construct a linked list alignocc{y) of all aligned occurences of y in the left- 
to-right order in w. If y is not A-periodic, each item of alignocc{y) consists 
of only one aligned occurrence of y dehned, for example, by its start position 
(we will call such items ordinary). If y is A-periodic, each item of alignoee{y) 
contains a run of aligned occurrences of y. If a run of aligned occurrences of y 
consists of only one occurrence, we will consider the item of alignocc{y) for 
this run as ordinary, otherwise, if a run of aligned occurrences of y consists of 
at least two occurrences, the item of alignocc{y) for this run will be dehned, 
for example, by start positions of leftmost and rightmost occurrences in the 
run and the value Pa(2/) (such item will be called runitem). The following 
fact follows from Proposition [6l 

Proposition 7 Let y', y" he two conseeutive aligned occurrences of a basie 
factor y in w. Then \beg{y') — beg{y")\ < \y\/2 if and only if y' and y" 
are eontained in the same runitem of alignocc{y) and, moreover, \beg{y') — 
begiy")\ =Pa(p)- 

Proposition [7| implies that if two aligned occurrences y', y" of a basic fac¬ 
tor y are contained in distinct items of alignocc{y) then \ beg{y') — beg{y'') \ > 
\y\/2. Therefore, we have the following consequence from the proposition. 

Corollary 3 Let y he a basic factor of w. Then for any segment v in w, the 
list alignocc{y) contains 0(|n|/|?/|) items having at least one occurrence of y 
contained in v. 
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To construct the lists alignocc, for each i = 1,2, ...,n and each I = 
0,1,..., A; — 1, we insert consecutively the occurrence y' = w[i.. i + 2^ — 1] of 
some basic factor y to the appropriate list alignocc{y) as follows. Consider the 
last item in the current list alignocc{y). Let it be an ordinary item consisting 
of an occurrence y” of y starting at position j. Denote 6 = i — j. Consider 
the following two cases for 5. Let 5 > \y\/2. Then, by Proposition [TJ y" and 
y' are contained in distinct items of alignocc{y), and in this case we insert 
y' to alignocc{y) as a new ordinary item. Now let 5 < \y\/2. In this case, 
by Proposition [71 y" and y' are the hrst two occurrences of the same run of 
occurrences of y and, moreover, 6 = PA{y)- Let r be the extension of the 
occurrences of this run. It is easy to see that 

end{r) = end{y') + LCP{end{y") + 1, end{y') + 1), 

i.e. end{r) can be computed in constant time. From the values heg{y"), 
end{r) and PA{y) we can compute in constant time the start position of the 
last occurrence of y in the considered run of occurrences and thereby identify 
completely this run. Thus, in this case we replace the last item of alignocc{y) 
by the identihed run of occurrences of y. Now let the last item in alignocc{y) 
be a run of occurrences. Then, if y' is not contained in this run, we insert 
y' to alignocc{y) as a new ordinary item. Thus, each occurrence of a basic 
factor in w is processed in constant time, and the total time for construction 
of lists alignocc is 0{n). 

Furthermore, in order to optimize the selection of appropriate occurrences 
y” in the checked triples {z, y', y"), for each pair {z, y') where 2 ; is an aligned 
occurrence of a superbasic factor and y' is an occurrence of some basic factor 
y associated with z, we compute a pointer firstocc{z, y') to the hrst item in 
alignocc{y) containing at least one occurrence of y to the right of 2 ;. For 
these purposes, we use auxiliary lists factends{i) dehned for each position i 
in w. Lists factends{i) consist of pairs {z,y') and are constructed at the 
stage computation of occurrences associated with aligned occurrences of su¬ 
perbasic factors: each time we hnd a new occurrence y' associated with an 
aligned occurrence 2 ; of a superbasic factor, we insert the pair {z, y') into 
the list factends{end{z) -|- 1). After construction of lists alignocc, we com¬ 
pute consecutively for each i = 1, 2,..., n pointers firstocc{z, y') for all pairs 
(z, y') from the list factends{i). During the computation, we save in each list 
alignocc{y) the last item pointed before (this item is denoted by lastpnt{y)). 
To compute firstocc{z, y'), we go through the list alignocc{y) from lastpnt{y) 
(or from the beginning of alignocc{y) if lastpnt{y) does not exist) until we 


18 


find the first item containing at least one occurrence of y to the right of the 
position i. The found item is pointed by firstocc{z,y') and becomes a new 
item lastpnt{y). Since the total size of lists alignocc and factends is 0(n), 
the total time of computing firstocc{z,y') is also 0 {n). 

(iv) Main step: computing large repeats 

At the main stage of the algorithm, in order to process each pair {z,y'), 
note that all appropriate for {z, y') occurrences y" contained in X{z) are 
located in the fragment of alignocc{y) consisting of all items having at least 
one occurrence of y contained in X{z). We will call this fragment checked 
fragment. Thus, we consider all items of the checked fragment by going 
through this fragment from the hrst item which can be found in constant 
time by the value firstocc{z,y'). For each considered item, we check triples 
{z, y', y") for all occurrences y" from this item as follows. 

Let the considered item be an ordinary item consisting of only one occur¬ 
rence y". Recall that gapped repeat («', u") dehned by the triple {z, y', y") 
can be computed in constant time by formulas (jl]). Thus, if {u',u") is an 
a-gapped repeat satisfying conditions ([ 2 ]), we output it. 

Now let the item considered in the checked fragment be a runitem. This 
implies that basic factor y is A-periodic, i.e y is A-periodic. Moreover, 
from the runitem we can derive the value PA{y)- Therefore we can compute 
in constant time extensions r' and r" of occurrences y' and y” respectively. 
Denote by p the run of occurrences contained in the runitem. Recall that our 
goal is to compute effectively all a-gapped repeats dehned by triples {z, y', y") 
such that y" G p. Note that, if r' and r" are the same repetition, then by 
Proposition [5] all such repeats are PR-repeats, therefore we can assume that 
r' and r" are distinct repetitions. Let («', u") be an a-gapped repeat dehned 
by a triple {z^y'^y”) where y" G p. First, consider the case when u' is not 
contained in r', i.e. either beg(M') < beg{r') or end{u') > end{r'). 

Proposition 8 If beg{u') < beg{r'), then beg{r') — beg{u') = beg{r") — 
begin''). 

Proof. Dehne 7' = beg{r') — beg{u'), 7" = beg{r") — beg{u"). Let 7' > 
7". Then u'['y' + per{y)] 7 ^ u'['y'] = u"['y'] = u"['y' + per{y)], i.e. we have 
a contradiction u'Ij' + per{y)] 7 ^ u"['j' + per{y)]. Similarly, we obtain a 
contradiction u'[j" + per{y)] 7 ^ u"[y" + per{y)] in the case 7' < 7". ■ 

The following proposition can be proved analogously. 
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Proposition 9 If end{u') > end{r'), then end{u') — end{r') = end{u") — 
end{r"). 

Define 


sieft = beg{y') + (begir”) - beg{r')), 
sright = beg{y') + {end{r") - end{r')). 

From Propositions [8] and [H we derive the following fact. 

Corollary 4 If heg{u') < beg{r') then beg{y") = sieft- If end{u') > end{r') 
then beg{y") = sright- 

Thus, for computing o-gapped repeats {u',u") such that u' is not con¬ 
tained in r', it is enough to consider in p only occurrences and y"ight with 
start positions siep and Snght respectively, provided that these occurrences 
exist. We check the occurrences and ylght fhe same way as we did 
for occurrence y" in the case of ordinary item. Then, it remains to check all 
occurrences from p except for possible occurrences and ylg^^f Denote by 
p' = p \ {y'lefty yright} occurreuces. Assume that \r'\ < |r"|, 

i.e. Sieft < Sright (the case \r'\ > \r''\ is similar). In order to check all occur¬ 
rences from p', we consider the following subsets of p' separately: subset p'^ 
of all occurrences y" such that beg{y") < siep, subset P 2 of all occurrences 
y” such that siep < beg{y") < Sright, and subset p'^ of all occurrences y" such 
that Sright < begify”). Note that start positions of all occurrences in each of 
these subsets form a hnite arithmetic progression with common difference 
PA^y)- Thus, we unambiguously denote all occurrences in each of the subsets 
p', i = 1,2,3, by j/q, p",..., where y'f is the leftmost occurrence in the 
subset p' and beg{y") = begijj'ff) + jpA^y) for j = 1,..., k. Note that values 
begiy'ff) and k for each subset p' can be computed in constant time. 

First, consider an occurrence y” from p(. Let vr = be the repeat 

dehned by triple {z, y', y”). Note that 

per(7r) = beg{y'-) - beg{y') = q + jpAiv), (5) 

where q = begiy'ff) — beg{y'). Taking into account that y' and y” are contained 
in maximal repetitions r' and r" respectively, it is easy to verify that 

LCS{beg{y') - 1, begiy") - 1) = beg{y'') - beg{r''), 

LCP{end{y') + 1, end(p") + 1) = end{r') — end{y'). 
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Therefore, begin') = beg{r") — peri^n) = q' — jpA^y), where q' = beg{r") — q, 
and end{u') = end{r'). It follows that 

c(7r) = \u'\ = end{u') — beg{u') + 1 = q" + jpAiy), 

where q" = end{r') + 1 — q'. Recall that for any a-gapped repeat tt, we have 
c(7r) < per{n) < Q!c(7r). Thns, tt is an a-gapped repeat if and only if 

q" <q< aq" + {a - l)jpA{y). (6) 

Moreover, u' has to satisfy conditions (jS]). Thus, the triple {z,y',y'j) defines 
an a-gapped repeat if and only if conditions (jh]) and ([3]) are verified for j. 
Note that all these conditions are linear inequalities on j, and then can be 
resolved in constant time. Thus, we output all a-gapped repeats defined by 
triples {z,y',y") such that y" G p'l in time 0(1 + S), where S is the size of 
the output. 

Now consider an occurrence y" from p' 2 . Let tt = {u',u") be the repeat 
defined by the triple {z,y',y'-). Note that in this case, per{n) also satishes 
relation (E]). Analogously to the previous case of set we obtain that 
beg{u') = beg{r') and end{u') = end{r'), and then c(7r) = \r'\. Therefore, vr 
is an a-gapped repeat if and only if 

W\ < d + ^ (V 

Thus, in this case, we output all a-gapped repeats dehned by triples (z, y', y'-) 
such that i satisfies conditions o and (jS]). Since all these conditions can 
be resolved for j in constant time, all these repeats can be output in time 
0(1 -|- S') where S is the size of output. 

Finally, consider an occurrence y'- from pg. Let vr = («', u") be the repeat 
dehned by triple {z,y',y'j). In this case, per{n) also satishes relation (E]). 
Analogously to the case of set p'^, we obtain that beg{u') = beg{r') and 
endin') = endir') — per(7r) = (f — jpa(p), where (f = end{r") — q, and then 

c(7r) = endin') - begin') + 1 = ^' -jpAiy), 

where q" = (f — begir') + 1. Therefore, vr is an a-gapped repeat if and only if 

- JPAiy) < q + jpAiy) < aiq' - jpAiy)). (8) 

Thus, in this case, we output all a-gapped repeats dehned by triples iz, y', y'-) 
such that i satishes conditions ([8]) and (jS]). Like in the previous cases, this 
can be done in time 0(1 -I- S), where S is the size of the output. 
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Putting together all the considered cases, we conclude that all a-gapped 
repeats dehned by triples {z, y', y") such that y” G p can be computed in 
time 0(1 + S) where S is the size of output. Thus, in 0(1 + S) time we can 
process each item of the checked fragment. Therefore, since by Corollary [3] 
the checked fragment has 0(a) items, the total time for processing pair 
{z,y') is 0(a + S) where S is the total number of a-gapped repeats dehned 
by triples {z,y',y''). Since each occurrence has no more than A associated 
occurrences y\ the total number of processed pairs {z,y') is 0(n). Thus the 
time complexity of the main stage of the algorithm is 0(an + S), where S is 
the size of the output. Taking into account that S = 0{an) by Theorem [2l 
we conclude that the time complexity of the main stage is 0(an). Thus, 
all maximal a-gapped non-PR repeats tt in tc such that c(7r) > logn can be 
computed in 0{an) time. 

(v) Computing small repeats 

To compute all remaining maximal a-gapped non-PR repeats in w, note that 
the length of any such repeat vr is not greater than 

(1 -I- a)c(7r) < (1 -1- log77.) log n < 2 log^ n. 

Thus, setting A' = [2 log^ n\ , any such repeat is contained in at least one of 
segments X' = w[iA' -|- 1.. (i -|- 2)A'] for 0 < z < n/A'. Therefore, all the 
remaining a-gapped repeats can be found by searching separately in segments 
X'. The procedure of searching for repeats in X' is similar to the algorithm 
described above. If a > log logn, searching for repeats in X' can be done by 
the algorithm proposed in [1]. The 0(|X'| log |X'|-|-S') time complexity implied 
by this algorithm, where by Theorem [2] the output size S is 0(a|X'|), can 
be bounded here by 0{aA'). Thus, the total time complexity of the search 
in all segments X' is 0{an). In the case of a < log logn, we search in each 
segment X' for all remaining maximal a-gapped non-PR repeats vr in tc such 
that c(7r) > log |X'| in time O(aA'), in the same way as we described above 
for the word w. The total time of the search in all segments X' is 0(an). 
Then, it remains to compute all maximal a-gapped non-PR repeats vr in w 
such that c(7r) < log|X'| < 3 log logn. Note that the length of any such 
repeat is not greater than 

(1 -|- a)3 log log n < (1 -f log log n)3 log log n < 6 log^ log n. 
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Thus, setting A" = [6 log^ log n\ , any such repeat is contained in at least one 
of the segments X" = w[i/X"+l.. {i+2)A"] for 0 < i < nj A". Note that these 
segments are words of length 2 A" over an alphabet of size a, therefore the 
total number of distinct segments X" is not greater than 
In each of the distinct segments X", all maximal a-gapped repeats can be 
found by the trivial algorithm described above in 0(A"^) = 0(log^log?7,) 
time. Thus, maximal a-gapped repeats in all distinct segments X" can be 
found in log^ logn) = o{n) time. We conclude that all remaining 

maximal a-gapped repeats in w can be found in 0{n + S) time where S is 
the total number of maximal a-gapped repeats contained in all segments Xf. 
According to Theorem [2], this number can be bounded by 0{an), and the 
time for finding all the remaining maximal a-gapped repeats can be bounded 
by 0(an) as well. This leads to the hnal result. 

Theorem 4 For a fixed a > 1, all maximal a-gapped repeats in a word of 
length n over a constant alphabet can he found in 0{an) time. 

Note that since, as mentioned earlier, a word can contain ©(an) maximal 
a-gapped repeats, the 0{an) time bound stated in Theorem 0] is asymptoti¬ 
cally optimal. 


5 Conclusions 

Besides gapped repeats we can also consider gapped palindromes which are 
factors of the form uvu^ where u and v are nonempty words and is the 
reversal of u im. A gapped palindrome uvu^ in a word w is called maximal 
if w[end{u) -|- 1 ] 7 ^ w[heg{u^) — 1 ] and w[heg{u) — 1 ] 7 ^ w[end{u^) -|- 1 ] for 
hegiu) > 1 and end{u^) < |r(;|. A maximal gapped palindrome uvu^ is a- 
gapped if |m| -|- |u| < a|u| [13]. It can be shown analogously to the results of 
this paper that for a > 1 the number of maximal a-gapped palindromes in a 
word of length n is bounded by 0{an) and for the case of constant alphabet, 
all these palindromes can be found in 0{an) tim^. 

In this paper we consider maximal a-gapped repeats with a > 1. However 
this notion can be formally generalized to the case of a < 1. In particular, 
maximal 1 -gapped repeats are maximal repeats whose copies are adjacent 

^Note that in m, the number of maximal a-gapped palindromes was conjectured to 
be O(a^n). 
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or overlapping. It is easy to see that such repeats form runs whose minimal 
periods are divisors of the periods of these repeats. Moreover, each run in 
a word is formed by at least one maximal 1-gapped repeat, therefore the 
number of runs in a word is not greater than the number of maximal 1- 
gapped repeats. More precisely, each run r is formed by [exp(r)/2j distinct 
maximal 1-gapped repeats. Thus, if a word contains runs with exponent 
greater than or equal to 4 then the number of maximal 1-gapped repeats is 
strictly greater than the number of runs. However, using an easy modihcation 
of the proof of “runs conjecture” from [2], it can be also proved the number 
of maximal 1-gapped repeats in a word is strictly less than the length of 
the word. Moreover, denoting by TZ{n) (respectively, TZi{n)) the maximal 
possible number of runs (respectively, maximal possible number of maximal 
1-gapped repeats) in words of length n, we conjecture that 7l{n) = TZi{n) 
since known words with a relatively large number of runs have no runs with 
big exponents. We can also consider the case of a < 1 for repeats with 
overlapping copies, in particular, the case of maximal 1/fc-gapped repeats 
where k is integer greater than 1. It is easy to see that such repeats form 
runs with exponents greater than or equal to fc -|- 1. It is known from [21 
Theorem 11] that the number of such runs in a word of length n is less than 
n/k, and it seems to be possible to modify the proof of this fact for proving 
that the number of maximal 1 /fc-gapped repeats in the word is also less than 
n/k = an. These observations together with results of computer experiments 
for the case of a > 1 leads to a conjecture that for any a > 0, the number 
maximal a-gapped repeats in a word of length n is actually less than an. 
This generalization of the “runs conjecture” constitutes an interesting open 
problem. Another interesting open question is whether the obtained 0{n/6) 
bound on the number of maximal (5-subrepetitions is asymptotically tight for 
the case of constant alphabet. 
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